optimize threading of mha #20088

yufenglee · 2024-03-26T16:59:14Z

Description

The cost computation of ComputeVxAttentionScore is wrong. It should be sequence_length * v_head_size * total_sequence_length instead of sequence_length * v_head_size * sequence_length.

The PR also fine-tuned the cost computation.

on my local box with i9 cpu, the performance is same as unfused version, but it is much faster on an azure vm with 16 threads.

Motivation and Context

#19924

BowenBao · 2024-03-26T17:19:45Z

Thanks @yufenglee for PR! Do you have before and after perf numbers for the repro in this issue #19924 ?

yufenglee · 2024-03-26T17:29:19Z

Thanks @yufenglee for PR! Do you have before and after perf numbers for the repro in this issue #19924 ?

on my local box, it take ~1.5ms before and ~0.5ms after.

tianleiwu · 2024-03-26T18:39:53Z

May need take a look at cost model approach to see why cost model cannot work properly since it is a fundamental for CPU EP. Maybe try use correct cost (like adding concat kv cost etc) to see whether it could resolve the issue as well.

tianleiwu · 2024-03-26T19:13:36Z

Please run benchmark for comparison of before/after:

BERT base, batch size 1/2/4, different sequence lengths 16/64/128/256/512 on different interop_num_threads=1, 2, 4, 8
GPT2 for decoding, batch size 1/2/4, with past seq len 16/64/128/256/512 on different interop_num_threads=1, 2, 4, 8
Just in case, in some situations, cost model is better.

yihonglyu · 2024-03-26T23:14:37Z

Description

The CostModel of threading doesn't not work for the attention especially for decoding case. It leads to less thread to compute the attention. Each batch*num_of_head is sufficient to serve as one unit. Change to use the TrySimpleParallelFor

Motivation and Context

#19924

Would change loop_len to batch_size * num_heads_ * sequence_length solve the issue?

onnxruntime/contrib_ops/cpu/bert/attention_cpu_base.h

yufenglee · 2024-04-01T19:06:48Z

The cost computation of ComputeVxAttentionScore is wrong. It should be sequence_length * v_head_size * total_sequence_length instead of sequence_length * v_head_size * sequence_length.

Also fine-tuned the cost computation for data load and store.

onnxruntime/contrib_ops/cpu/bert/attention_cpu_base.h

### Description  The cost computation of ComputeVxAttentionScore is wrong. It should be sequence_length * v_head_size * total_sequence_length instead of sequence_length * v_head_size * sequence_length. The PR also fine-tuned the cost computation. on my local box with i9 cpu, the performance is same as unfused version, but it is much faster on an azure vm with 16 threads. ### Motivation and Context  microsoft#19924

optimize threading of mha

d32a693

yufenglee requested review from yihonglyu, kunal-vaishnavi and tianleiwu March 26, 2024 16:59

yufenglee force-pushed the yufeng/mha_opt branch from d32a693 to 75787c5 Compare March 26, 2024 17:15

tianleiwu previously approved these changes Mar 26, 2024

View reviewed changes

kunal-vaishnavi previously approved these changes Mar 26, 2024

View reviewed changes

tianleiwu reviewed Mar 27, 2024

View reviewed changes

onnxruntime/contrib_ops/cpu/bert/attention_cpu_base.h Show resolved Hide resolved

tianleiwu reviewed Mar 27, 2024

View reviewed changes

onnxruntime/contrib_ops/cpu/bert/attention_cpu_base.h Show resolved Hide resolved

tune cost

dc3d01b

yufenglee dismissed stale reviews from kunal-vaishnavi and tianleiwu via dc3d01b April 1, 2024 19:00

yufenglee force-pushed the yufeng/mha_opt branch from 75787c5 to dc3d01b Compare April 1, 2024 19:00